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Abstract 

The implementation of a proof-of-concept Lattice Quantum Chromodynamics kernel on the Cell processor is described 
in detail, illustrating issues encountered in the porting process. The resulting code performs up to 45GFlop/s per 
socket, indicating that the Cell processor is likely to be a good platform for future Lattice QCD calculations. 
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1. Introduction 

The Cell processor is a heterogeneous multi-core 
processor [T], originally designed for high-bandwidth 
media applications, but also having potential appli- 
cations in traditional HPC fields such as computa- 
tional science, fnterest in computational science on 
the Cell processor |2)3|4j is driven by its high peak 
floating point capability of up to 256GFlop/s in sin- 
gle precision. Cell-based systems are available in a 
blade server form factor, making possible massively 
parallel machines such as the RoadRunner petaflop 
supercomputer currently under construction [5]. 

In order to exploit the power of the Cell hardware, 
it is necessary to overcome challenges associated 
with the programming model, particularly the ways 
in which it differs from conventional processors. The 
greatest difference is the Cell memory model, which 
replaces caches with software-managed local stores 
and DMA engines. This leads to a requirement for 
explicit communication between processing cores 
and main memory not found in conventional multi- 
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core programs. Another significant obstacle to writ- 
ing code for the Cell or porting code to it is the 
SIMD instruction set used, which must be taken 
into account in order to obtain good floating point 
performance. To some extent this difficultly may be 
mitigated by development tools, but as explained 
in section |2.4| hand-optimised Cell-specific code is 
needed at present. 

Since the details of the Cell architecture have been 
well explained in other publications[l B], here we 
provide only a brief summary. The Cell processor 
consists of nine cores: one Power Processing Element 
(PPE) and eight Synergistic Processing Elements 
(SPE). The PPE is a conventional PowerPC pro- 
cessor, including a 512kB L2 cache. The SPEs are 
SIMD processors, optimised to operate on 128 bit 
vectors. Each SPE has a 256kB local store (LS) and 
associated DMA engine with access to main memory 
as well as the LS of other SPEs. The fabric connect- 
ing processing elements is a bus with a peak band- 
width of 204.8GB/S, with 25.6GB/s ports for each 
processing element and for the memory controller. 
The high bandwidth of the bus leaves the bandwidth 
to the main memory controller as the bottleneck in 
data movement. 
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The vast majority of floating point power lies in 
the SPEs. In the best case, each of the 8 SPEs is ca- 
pable of performing one 128 bit fused multiply-add 
operation per cycle, corresponding to eight single 
precision floating point operations, rendering a peak 
performance of 25.6GFlop/s per SPE on a 3.2GHz 
processor. Any code which is to perform well on the 
Cell processor must scale well enough to use all of 
the SPEs, and be written with a view to SIMD op- 
erations. Although not all applications can satisfy 
these criteria, those based on linear algebra opera- 
tions tend to map well to parallel systems and to 
SIMD instruction sets. 

To illustrate the issues associated with obtaining 
high performance on the Cell processor, this paper 
describes the process of porting a computational ker- 
nel from Lattice Quantum Chromodynamics (Lat- 
tice QCD). Lattice QCD simulations are extremely 
computationally demanding, to such a degree that 
massively parallel machines have been custom-built 
solely to run such calculations [2lSj. This paper is 
limited in scope to programs on a single Cell ma- 
chine, but demonstrating the per-Cell performance 
possible for Lattice QCD is an important step to- 
ward further work on a multi-node Cell system. 



2. Implementation 

Prior to the work described here, various methods 
for Cell programming were experiented with by the 
authors [3]. The implementation presented here ig- 
nores issues of ease-of-programming and maintain- 
ability, in favor of seeking the maximum possible 
performance. 

The SPE management library libspe2[T0|. in- 
cluded in the Cell SDK, provides basic threading 
functionality loading code onto an SPE and execut- 
ing it. All synchronisation and data movement is 
left up to the programmer. This low level approach 
provides great flexibility, but also demands rela- 
tively complex code to manage the SPEs, as well 
as complex code in the SPE programs to perform 
DMA operations and synchronisation as well as the 
actual task underway. 

Creating a SPE thread using libspe2 requires 
only a few straightforward function calls. Firstly, 
a context is created (spe .context .create), where 
a context represents one physical SPE. Next, 
a SPE binary is loaded (spe_program_load) to 
the context, and flnally the SPE program is run 
(spe_context_run). The SPE program is compiled 



with a separate toolchain to the PPE program, and 
then embedded in the PPE executable. 

Although it is possible to implement Cell- 
accelerated versions of each stage in the linear alge- 
bra procedure, and call them sequentially from ex- 
isting procedures, this approach cannot yield good 
performance, since the bandwidth to main mem- 
ory is vastly outstripped by the combined floating 
point power of the SPEs. To make the most of the 
floating point power of the SPEs, it is necessary to 
persistently distribute the dataset across the SPEs. 

2.1. Wilson-Dslash 

Lattice QCD is a discretised form of a fleld theory, 
calculated on a 4D space-time lattice. Here, we refer 
to the spatial size of the lattice as L and the size 
in time as T, such that the lattice has L^T sites in 
total. 

A common benchmark which forms the most com- 
putationally intensive part of lattice QCD calcula- 
tions is the the inversion of the Dirac operator, a 
procedure known as Dslash |11I12| . In this paper, 
the physical or mathematical meaning of these oper- 
ations is not discussed: all that is done is to directly 
transform an existing scalar Dslash procedure into 
a Cell-optimised form. A special-case Dslash rou- 
tine is presented which operates only in single pre- 
cision, only on 4 dimensional systems, and only on 
lattice data laid out in a red-black checkerboard. We 
base our Cell-optimised Dslash on a reference imple- 
mentation provided in the freely available QDP-I— H 
package|I3j. In this reference implementation, oper- 
ations between neighbouring lattice sites are repre- 
sented by operations between relatively shifted in- 
stances of the lattice data. To write the Cell version 
of the code, all these operations are unrolled into 
explicit calculations and DMA operations. 

The Dslash procedure takes as input a spinor Psi 
(a 4 X 3 complex array at each lattice site) , and four 
color matrices u (3 x 3 complex arrays at each 
lattice site), one for each dimension. The output is 
the spinor Chi. Intermediate storage is needed for 
half -spinors (a 2 x 3 complex array at each lattice 
site), which represent the so-called projected form 
of Psi. 

2.2. Data arrangement 

A one-dimensional block decomposition over the 
SPEs (of which we have A^spe in total) in the time 



dimension is used to divide the lattice into a T/Nspe 
thick shce stored in the LS of each SPE. This is 
inherently load-balanced as long as T is a multiple 
of the number of SPEs. This is not a problematic 
requirement, since existing lattice QCD simulations 
frequently use power-of-two sizes. For convenience, 

let Tchunk = T/NsPE- 

The operands to the Dslash operation are contigu- 
ous in main memory when using the default QDPH — h 
red-black checkerboard indexing scheme, wherein 
the slowest varying index is the checkerboard (0 or 
1), and the next slowest varying index is the time 
coordinate. The calculation of an element of Chi on 
the SPE requires the surrounding Psi values. A se- 
ries of index manipulations are performed to locate 
these values in the transferred buffers, which does 
not need to be particularly efficient, since it is done 
once at initialisation, and pointers to the data re- 
quired for each Chi element are stored. This look-up 
table sacrifices local store space for computational 
efficiency and the development time saved by not 
having to optimise the index manipulations. 

The operands are transferred to and from the SPE 
local stores using the SPE DMA engines. Issuing 
DMA instructions is a straightforward low-overhead 
operation, exposed by the SDK as C intrinsics|14j. 
The addresses in main memory from which operands 
are transferred are contained in a structure which is 
loaded by the SPE threads from an address passed 
at initialisation by the PPE thread. 

A beneficial characteristic of the Dslash proce- 
dure is that it is typically used repeatedly for the 
same value of the gauge field U, meaning that the 
U data need only be transferred once. As such, the 
time taken to transfer U is negligible when the pro- 
cedure is run many times and hence the data trans- 
ferred at each Dslash call is limited to the Psi input 
and Chi output. There is some scope for hiding com- 
munications time by double buffering, that is per- 
forming some calculation before Psi is completely 
loaded. This is not done here, but ad-hoc measure- 
ments of communications time vs. calculation time 
indicate that any gains from double buffering would 
not drastically alter the final performance result. 



2.3. Padding 

Since a color matrix (a 3 x 3 complex matrix) has 
size 72 bytes in single precision, if the first element 
in an array of color matrices is aligned to a 16B 
boundary then the subsequent element would be off- 



set by 8 bytes, (72% 16 = 8). This causes a problem 
for SIMD operations, since the SPE load and store 
instructions only work on 16 byte boundaries. To 
solve this, one may either check the alignment of a 
color matrix prior to operating on it, and copy it out 
to an aligned location if necessary, or one may pad 
each element by 8 bytes. 

Since the color matrices are loaded only once at 
initialisation, the cost of padding the color matrix 
arrays on the SPE is negligible when Dslash is called 
many times. Conversely, the cost of making aligned 
copies when operating on the color matrix is in- 
curred at each invocation of the Dslash procedure. 
The padding clearly incurs a penalty in the local 
store space required: an increase from 72 to 80 bytes 
per color matrix, or 11%. The color matrices consti- 
tute the majority of the local store space used for 
data, so this increase is significant to the overall lo- 
cal store requirement. This is only a worthwhile sac- 
rifice as long as the serial performance of operations 
on the SPE is a bottleneck: if the procedure were 
communications-limited then it might be preferable 
to make per-operation aligned temporaries of an un- 
padded array and thus accommodate a larger sub- 
lattice. 



2.4. SIMD arithmetic 

SIMD-aware code is necessary to obtain even 
reasonable performance on the SPE, because SPE 
load/store instructions operate on 16 byte vectors, 
and only at 16 byte aligned locations in the local 
store. To load a vector spanning a 16 byte bound- 
ary, two vector loads are required, followed by a 
shuffle operation to compose the desired vector in a 
register. This is vastly slower than loading a prop- 
erly aligned vector, which is accomplished in one 
load instruction (6 cycles). Similarly, to perform 
arithmetic operations on a scalar value and extract 
a scalar result requires a series of shuffle operations 
in addition to the arithmetic instruction. 

SIMD operations are exposed as instrinsic func- 
tions in the SPE C-I--I- compiler[I4 , operating on 
16 byte vector datatypes. These allow low-level 
SIMD programming without the need to write as- 
sembly language. To perform operations on permu- 
tations of the data other than the native layout, the 
spu_shuff le function is used. This can rearrange 
the data in two input vectors to arbitrary positions 
across two output vectors. This is used extensively 
in the Dslash operation, along with the usual arith- 



metic operations and the fused niultiply-add pro- 
vided by the SPE's FPU). 

The Dslash operation has three main stages: spin 
projection, multiplication by the color matrix, and 
spin reconstruction. Each spin projection and recon- 
struction function has eight variants, correspond- 
ing to forwards and backwards in each dimension. 
Although mathematically each of these variants is 
simply a matrix multiplication by a different ma- 
trix, in practice they are programmatically distinct 
since the matrices in question are constant, small 
and sparse, so the implementation of the multiplica- 
tion is unrolled into the operations required for the 
non-zero elements. The color matrix multiplication 
has two variants, one for multiplying a vector by the 
color matrix and another for multiplying by its ad- 
joint. Each of these 10 operations is hand-translated 
from scalar code in QDP++ to SIMD intrinsics for 
the SPE. 

Optimisation of the SIMD code is facilitated by 
the IBM Assembly Visualizer[?], for the Cell. This 
tool provides a graphical display of the execution of 
compiled code on a SPE, illustrating pipeline stalls 
and data dependencies. The ability to see the char- 
acteristics of the code without running it provides a 
very rapid turnaround and takes much of the guess- 
work out of optimisation. To obtain a more detailed 
view of execution, the IBM Full System Simulator[?] 
may be used to model the SPE with cycle accu- 
racy. The simulator is also useful for fine-grained 
benchmarking of SPE code since the simulated ex- 
ecution environment provides convenient access to 
cycle counters. 

As an example of the optimisation process, the 
forward spin projection phase of the calculation is 
considered. Compiling the reference C implementa- 
tion in QDP++ using spuxlc -03 -qhot and using 
cycle counters in the Cell simulator gives a timing of 
141335 cycles to project 1024 spinors. Rewriting the 
arithmetic as a series of spujnadd and spu_shuf f le 
calls gives a modest runtime decrease to 98291 cy- 
cles. Using register vector float type tempo- 
raries to allow the compiler to reorder loads and 
stores more aggressively reduces the timing further 
to 21084 cycles, or about 21 cycles per spinor. In 
the code generated from the final version, all the 
multiply-add, shuffle, load and store instructions are 
pipelined. 

The SPE has a large register file, with 128 reg- 
isters of 16 bytes each, easily enough to accommo- 
date a color matrix (5 registers), a spinor (6 regis- 
ters) and many temporaries. However, the compiler 



does not always take advantage of this, particularly 
it does not tend to keep values in registers between 
subsequent inlined functions. To obtain the better 
performance this is overcome by fusing functions 
which re-use the same data. Rather than having sep- 
arate projection, matrix multiplication and recon- 
struction functions, the three are combined, such 
that all three operations can share the same regis- 
ter temporaries. This requires 8 fused functions, one 
for each direction. Since the color matrix multiplica- 
tion functions (plain and adjoint) are each common 
to 4 of the fused functions, they are implemented 
as pre-processor macros, operating on register vec- 
tors floats defined in the scope into which they are 
included. Although this is substantially less elegant 
and flexible than having separate functions which 
the compiler inlines, keeping the data in registers 
avoids several spurious loads and stores. 

The SPE's fused multiply-add instruction has the 
same cost as a single add or multiply, so when do- 
ing a multiply an extra add is effectively free, and 
vice- versa. This is exploited in the Dslash procedure 
by implementing mixtures of subtraction and ad- 
dition as multiplications by arrays of ±1 combined 
with an addition. To retain a fair comparison to the 
original code, these extra multiplications by ±1 are 
not counted in the Flop/s. Using this scheme, our 
optimised spin projection function obtains a perfor- 
mance of 1864MFlop/s per SPE, compared to the 
original scalar code's 278MFlop/s. 

The process of optimising for the Cell is utterly 
destructive of the original, producing code whose op- 
eration is difflcult to understand by inspection. As 
such, these Cell-optimised versions must always be 
maintained separately to generic code. This fits into 
the existing structure of QDP++, which already in- 
cludes versions of key functions optimised for the 
SSE architecture, in addition to the generic imple- 
mentation. 



2.5. Extension notes 

This implementation runs only on a single cell 
blade. If the Cell were to be used 'in anger' for Lat- 
tice QCD then the code would have to interleave off- 
node communications with on-Cell calculations and 
communications. The current on-Cell parallelisation 
where by the division between SPEs is in the T di- 
mension could be retained in a multi-node version of 
the code, keeping the T dimension on-node and di- 
minishing requirements for the inter-node intercon- 



nect to a 3D torus. If the dataset was permanently 
distributed across SPEs rather than being scattered 
and gathered at each iteration, it would be natural 
to have the ofF-node communications controlled by 
the SPE threads, rather than involving the PPE. In 
fact, given that the Cell has I/O interfaces directly 
connected to the EIB, it is possible that the net- 
work interface would be directly connected to one 
of these, meaning that the data-flow could pass di- 
rectly from the SPE to the network card, bypassing 
main memory entirely, and ending up directly on an 
SPE on another node. Whether this could be imple- 
mented effectively is a question for those engineering 
the communications hardware and software stack. 

In the current code, communication in the T di- 
mension is implicit in the distribution and collection 
of the fermion (Chi and Psi) data from and to main 
memory at each Dslash call. However, if no further 
manipulations to the data are required in between 
Dslash calls then it would be more eflicient to keep 
this data on the SPEs. In this case, it would be neces- 
sary to implement halo swapping in the T direction 
in between the SPEs. Once this is the case, the PPE 
would be largely uninvolved apart from setting the 
SPEs running for a given number of iterations. In 
this case, it may be possible to save some amount of 
synchronisation runtime by implementing a peer-to- 
peer synchronisation between SPEs while iterating 
the Dslash procedure. A higher performance barrier 
may also be motivated by more frequent synchroni- 
sation, if off-node communications demanded mul- 
tiple synchronisation points during each Dslash pro- 
cedure. 

Transfers larger than 16kB are currently per- 
formed by enqueuing multiple DMA transfers of up 
to 16kB each. Alternatively, one could compose a 
DMA list of such transfers and reduce overheads 
by generating this list at startup and enqueuing it 
at each iteration of the algorithm. This may yield 
superior performance. 

The performance of the Cell implementation is 
such that a Dslash over a (6,4) sub-lattice takes 33ns 
per site, or 28.5ps in total. Using existing bench- 
marks of off-the-shelf interconnects [15^, one can 
compare this on-chip runtime to inter-node commu- 
nications in a multi-node system. Current off-the- 
shelf interconnects offer latencies of the order a few 
ps, while the size of the halo data for a sub-lattice 
this size is of the order tens of kB, which may be 
expected to incur a communications time around 
10-20ps in the best case. This communications time 
is of the order of the serial communications time, 



so for best sustained performance in a multi-node 
system it would be imperative to overlap off-node 
computation and calculation. 

Currently there is a fairly low ceiling on the size 
of sub-lattice which maybe accommodated on each 
SPE. This limit could be made more flexible at the 
cost of increased communications by streaming part 
or all of the U arrays at each iteration rather than 
requiring the whole dataset to fit into the LS at 
one time. Equally, the Psi and Chi data could be 
treated in a streaming manner rather than having 
full-sized local arrays in the LS for the whole sub- 
lattice assigned to the SPE. Some quantity of the 
added communications cost could be hidden using 
double-buffering. This increased flexibility would be 
particularly relevant in a parallel machine which 
may not necessarily have enough processors for each 
Cell to have such as small portion of the lattice as 
is currently required. 



3. Performance 

All benchmarks were run on a QS20 Cell blade, 
with two Cell processors and 1GB of memory for 
each processor. The two processors in the QS20 
blade are configured such that their fabrics are com- 
bined to give seamless access to all SPEs from one 
PPE program. Benchmarks are run on a varying 
number of SPEs, denoted by iVgpE, from 2 to 16 
(recall that each Cell processor has 8 SPEs). Each 
timing result is an average over 3 runs. Where the 
system size is varied, the spatial size is denoted L 
and the time dimension T, such that there are in 
total L^T sites. As a shorthand, system sizes are 
written (L, T). 

Performing typical strong scaling benchmarks is 
complicated by the limitation on the data size which 
can be accommodated in the local stores. A system 
size which will can be run on two SPEs is some- 
what inadequate to test the performance of 16. For 
this reason a variety of system sizes were used, some 
providing coverage of low numbers of SPEs, some 
testing larger numbers of SPEs. In addition, weak 
scaling was measured using a system size with L = 
2 and T proportional to the number of SPEs used. 
This provides a direct test of the synchronisation 
and communications, since the amount of serial work 
per SPE remains constant. 

The results of these benchmarks are shown in Fig. 
[l] Each series shows a different system size, including 
the weak scaling case in which the system size is 
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Figure 1. Performance of Cell optimised Wilson-Dslash 
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Figure 2. Weak scaling of Cell optimised Wilson-Dslash 

proportional to the number of SPEs. In the following 
discussion, system sizes are expressed as (L,T) for 
brevity. 

The weak scaling case is shown in more detail in 
Fig. |2] where it is compared with a straight line be- 
tween the NspE = 2 point and the origin. There is 
close to linear scaling up to 8 SPEs, which falls away 
for greater numbers of SPEs. This corresponds to the 
regime in which some of the SPEs are on a separate 
chip, with a connection between the two EIBs pro- 
vided over a 25.6GB/s I/O port. The almost-linear 
weak scaling plot indicates that the performance is 
limited largely by serial performance. 

The (6,4) system is shown in Fig. fl] as a sin- 
gle datapoint, as it runs only on exactly 8 SPEs: 
a smaller number of SPEs would not provide suffi- 
cient LS capacity, and to use more would require a 
larger size in the time dimension in order to provide 
each SPE with at least one 3D slice. This system 
is of particular interest in the context of a multi- 
node system, along with the (4, 24) system from the 



weak scaling series (the (4, 24) system is run on 8 
SPEs). The (4, 24) system gives higher performance, 
but it is quite limited in this whole-lattice time di- 
mensions possible: only systems with multiple-of-24 
time lengths would be possible. The (6,4) system 
uses only around half of the available LS, but would 
be much more flexible in terms of simulations on 
different Tsizes across multiple nodes. A good com- 
promise might be possible if some quantity of the 
color matrix arrays were streamed from main mem- 
ory at each iteration, slightly decreasing the local 
store space requirement and allowing a (6,8) sys- 
tem on one Cell processor, at the cost of increased 
main-memory I/O. 

Smaller systems which can be run on smaller num- 
bers of SPEs are also shown on Fig.[T] These scale to 
smaller numbers of SPEs before performance starts 
to degrade. The only system included which can be 
run on the whole range of SPEs is the (2,48) size, 
which is small enough to run on 2 SPEs, but suff- 
ciently long in the time dimension to run on 16 SPEs. 
Performance of this system falls off from 6 SPEs, 
although still reaches 30GFlop/s on one Cell pro- 
cessor (8 SPEs), which remains impressive in com- 
parison with conventional processors. The (4, 8) sys- 
tem achieves close to 40GFlop/s on 8 SPEs, making 
it a reasonably efficient choice of system size which 
could be useful in multi-node system for construct- 
ing power-of-two-sized global lattices. 

To compare cell performance to a conventional 
processor, a (6,4) lattice is used to allow the con- 
ventional processor to work in L2 cache (this lat- 
tice is size of 145kB(l spinor per site, 4 color matri- 
ces per site). A larger lattice would cause the con- 
ventional platforms to work out of main memory, 
while the Cell version would simply not be capa- 
ble of it. However, if the Cell version were adapted 
to stream parts of the dataset out of main mem- 
ory then it would probably remain competitive with 
conventional platforms due to the Cell's relatively 
high memory bandwidth. 

As an example of a conventional x86 proces- 
sor, a 2.4GHz Xeon system was used, along with 
a high performance SSE3 implementation of the 
Dslash operation [lljl2j encapsulated in a library 
called inteLsse_wilson_dslash. This is included with 
the Chroma library and optionally compiled in on 
systems which support SSE. This provides an im- 
pressive performance of 4.5GFlop/s, demonstrating 
the high performance possible with this algorithm 
using low- level SIMD programming. However, the 
single precision floating point performance of the 



Cell implementation presented here is almost 10 
times greater than even the SSE-optimised version 
on the Xeon. 

4. Conclusion 

The code described here obtained up to 45GFlop/s 
performance in single precision, providing a proof- 
of-concept for running Lattice QCD simulations on 
the Cell processor. The optimisation for Cell was 
accomplished entirely within the C environment 
provided by the Cell SDK, although the extensive 
use of SIMD intrinsics led to less readable code. 
Extensions to this work could include a multi-node 
parallel version of this code, and a double preci- 
sion version, which could be expected to achieve 
20GFlop/s on the forthcoming double-precision 
Cell processor. However, the system sizes shown 
here would need to be revised if the Local Store 
on the SPUs was not increased accordingly for a 
double-precision Cell. 

The Cell processor remains challenging to the sci- 
entific application programmer, demanding signif- 
icant platform-specific knowledge to obtain good 
performance. However, for applications concentrat- 
ing work in a single kernel ~ as Lattice QCD codes 
do in Dslash ~ the effort involved in producing a 
Cell-optimised version of the kernel can render an 
order-of-magnitude performance improvement over 
conventional platforms. 
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